Finding large average submatrices in high dimensional data
نویسندگان
چکیده
The search for sample-variable associations is an important problem in the exploratory analysis of high dimensional data. Biclustering methods search for sample-variable associations in the form of distinguished submatrices of the data matrix. (The rows and columns of a submatrix need not be contiguous.) In this paper we propose and evaluate a statistically motivated biclustering procedure (LAS) that finds large average submatrices within a given real-valued data matrix. The procedure operates in an iterative-residual fashion, and is driven by a Bonferroni-based significance score that effectively trades off between submatrix size and average value. We examine the performance and potential utility of LAS, and compare it with a number of existing methods, through an extensive three-part validation study using two gene expression datasets. The validation study examines quantitative properties of biclusters, biological and clinical assessments using auxiliary information, and classification of disease subtypes using bicluster membership. In addition, we carry out a simulation study to assess the effectiveness and noise sensitivity of the LAS search procedure. These results suggest that LAS is an effective exploratory tool for the discovery of biologically relevant structures in high dimensional data. Software is available at https://genome.unc.edu/las/.
منابع مشابه
On the maximal size of large-average and ANOVA-fit submatrices in a Gaussian random matrix.
We investigate the maximal size of distinguished submatrices of a Gaussian random matrix. Of interest are submatrices whose entries have an average greater than or equal to a positive constant, and submatrices whose entries are well fit by a two-way ANOVA model. We identify size thresholds and associated (asymptotic) probability bounds for both large-average and ANOVA-fit submatrices. Probabili...
متن کاملExtracting embedded generalized networks from linear programming problems
If a linear program tLP) possesses a large generalized network (G N) submatrix, this structure can be exploited to decrease solution time. The problems of finding maximum sets of GN constraint s and finding maximum embedded GN sub matrices are shown to be NP-complete, indicating that reliable, efficient solution of these problems is difficult. Therefore, efficient heuristic algorithms are devel...
متن کاملTendency based Subspace Clustering on Gene Expression Data
Microarrays are one of the latest breakthroughs in experimental molecular biology. By monitoring expressions of different genes under different experiments, a large matrix representing the gene expression levels of varying experiments will be produced. To reveal patterns in such matrices, Ben-Dor et al. introduced a probabilistic model to discover the strictly order-preserving submatrix (OPSM) ...
متن کاملRobust high-dimensional semiparametric regression using optimized differencing method applied to the vitamin B2 production data
Background and purpose: By evolving science, knowledge, and technology, we deal with high-dimensional data in which the number of predictors may considerably exceed the sample size. The main problems with high-dimensional data are the estimation of the coefficients and interpretation. For high-dimension problems, classical methods are not reliable because of a large number of predictor variable...
متن کاملFeature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach
Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering a...
متن کامل